Memory Efficient De Bruijn Graph Construction
نویسندگان
چکیده
Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Θ(kn) to Θ(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset. Source codes and datasets: grafia.cs.ucsb.edu/msp
منابع مشابه
HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly
BACKGROUND The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the de Bruijn graph is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barri...
متن کاملOn Binary de Bruijn Sequences from LFSRs with Arbitrary Characteristic Polynomials
We propose a construction of de Bruijn sequences by the cycle joining method from linear feedback shift registers (LFSRs) with arbitrary characteristic polynomial f(x). We study in detail the cycle structure of the set Ω(f(x)) that contains all sequences produced by a specific LFSR on distinct inputs and provide an efficient way to find a state of each cycle. Our structural results lead to an e...
متن کاملTwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes
Motivation de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct con...
متن کاملDe Bruijn Graph Homomorphisms and Recursive De Bruijn Sequences
This paper presents a method to find new de Bruijn cycles based on ones of lesser order. This is done by mapping a de Bruijn cycle to several vertex disjoint cycles in a de Bruijn digraph of higher order and connecting these cycles into one full cycle. We characterize homomorphisms between de Bruijn digraphs of different orders that allow this construction. These maps generalize the well-known ...
متن کاملA method for constructing decodable de Bruijn sequences
In this paper we present two related methods of construction for de Bruijn sequences, both based on interleaving “smaller” de Bruijn sequences. Sequences obtained using these construction methods have the advantage that they can be “decoded” very efficiently, i.e., the position within the sequence of any particular “window” can be found very simply. Sequences with simple decoding algorithms are...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1207.3532 شماره
صفحات -
تاریخ انتشار 2012